Our project

We decided to do central Colombia, basically because it is where the capital is.

We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.

We decided to consider as central Colombia the following departments/districts: Bogotà DC, Boyacá, Tolima, Cundinamarca, Meta, Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá, Antioquia, Santander, Casanare.

Loading the dataset

This dataset is missing completely the department Valle del Cauca!

colombia_covid <- as.data.frame(read_csv("data/covid19co.csv"))
cols <- colnames(colombia_covid)[c(1, 4, 5, 6, 7, 8, 9, 11, 14)]
colombia_covid <- colombia_covid[cols]
colombia_covid <- colombia_covid[, c(1, 9, 2, 3, 4, 5, 6, 7, 8)]
colnames(colombia_covid) <- c("ID de caso", "Fecha de diagnóstico", "Ciudad de ubicación", "Departamento o Distrito", "Atención", "Edad" , "Sexo", "Tipo", "País de procedencia")
#colombia_covid$`Departamento o Distrito`[which(colombia_covid$`Departamento o Distrito` == "Valle Del Cauca")] <- "Valle del Cauca"
central.colombia.dep <- c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca",
    "Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows <- which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid <- colombia_covid[central.colombia.rows, ]
colombia_covid <- colombia_covid[-which(colombia_covid$`Fecha de diagnóstico`== "-"), ]
head(colombia_covid, 5)
##   ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1          1           06/03/2020         Bogotá D.C.             Bogotá D.C.
## 3          3           09/03/2020            Medellín               Antioquia
## 4          4           11/03/2020            Medellín               Antioquia
## 5          5           11/03/2020            Medellín               Antioquia
## 6          6           11/03/2020              Itagüí               Antioquia
##     Atención Edad Sexo        Tipo País de procedencia
## 1 Recuperado   19    F   Importado              Italia
## 3 Recuperado   50    F   Importado              España
## 4 Recuperado   55    M Relacionado                 Nan
## 5 Recuperado   25    M Relacionado                 Nan
## 6 Recuperado   27    F Relacionado                 Nan

Description of variables

  • ID de caso: ID of the confirmed case.

  • Fecha de diagnóstico: Date in which the disease was diagnosed.

  • Ciudad de ubicación: City where the case was diagnosed.

  • Departamento o Distrito: Department or district where the city belongs to.

  • Atención: Situation of the patient: recovered, at home, at the hospital, at the ICU or deceased.

  • Edad: Age of the confirmed case.

  • Sexo: Sex of the confirmed case.

  • Tipo: How the person got infected: in Colombia, abroad or unknown.

  • País de procedencia: Country of origin if the person got infected abroad.

Map

Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.

Preprocessing

We had to clean the dataset:

  • We transformed the Fecha de diagnóstico variable into a Date type variable,

  • we fixed the variable Id de caso (since we removed some departments, so some lines, the numbers weren’t consecutive),

  • we created a variable Grupo de edad,

  • we cleaned the column País de procedencia (replaced cities with the country) and created the variable Continente de procedencia (as the first is too fragmented we thought to consider the continents).

##    ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1           1           2020-03-06         Bogotá D.C.             Bogotá D.C.
## 2           2           2020-03-09            Medellín               Antioquia
## 3           3           2020-03-11            Medellín               Antioquia
## 4           4           2020-03-11            Medellín               Antioquia
## 5           5           2020-03-11              Itagüí               Antioquia
## 6           6           2020-03-11         Bogotá D.C.             Bogotá D.C.
## 7           7           2020-03-11         Bogotá D.C.             Bogotá D.C.
## 8           8           2020-03-12         Bogotá D.C.             Bogotá D.C.
## 9           9           2020-03-12         Bogotá D.C.             Bogotá D.C.
## 10         10           2020-03-13       Villavicencio                    Meta
##      Atención Edad Sexo        Tipo País de procedencia Grupo de edad
## 1  Recuperado   19    F   Importado              Italia         19_30
## 2  Recuperado   50    F   Importado              España         46_60
## 3  Recuperado   55    M Relacionado                 Nan         46_60
## 4  Recuperado   25    M Relacionado                 Nan         19_30
## 5  Recuperado   27    F Relacionado                 Nan         19_30
## 6  Recuperado   22    F   Importado              España         19_30
## 7  Recuperado   28    F   Importado              España         19_30
## 8  Recuperado   36    F   Importado              España         31_45
## 9  Recuperado   42    F   Importado              España         31_45
## 10 Recuperado   30    F   Importado              España         19_30

New dataset I

##          Date Elapsed time New cases/day Cumulative cases
## 1  2020-03-06            0             1                1
## 2  2020-03-09            3             1                2
## 3  2020-03-11            5             5                7
## 4  2020-03-12            6             2                9
## 5  2020-03-13            7             3               12
## 6  2020-03-14            8            14               26
## 7  2020-03-15            9            13               39
## 8  2020-03-16           10             8               47
## 9  2020-03-17           11            13               60
## 10 2020-03-18           12             9               69

New dataset II

##          Date Elapsed time Department Department ID New cases/day
## 1  2020-03-09            3  Antioquia             1             1
## 2  2020-03-11            5  Antioquia             1             3
## 3  2020-03-14            8  Antioquia             1             3
## 4  2020-03-15            9  Antioquia             1             1
## 5  2020-03-19           13  Antioquia             1             3
## 6  2020-03-20           14  Antioquia             1            11
## 7  2020-03-21           15  Antioquia             1             3
## 8  2020-03-22           16  Antioquia             1             5
## 9  2020-03-23           17  Antioquia             1            22
## 10 2020-03-25           19  Antioquia             1             8
##    Cumulative cases/Department Mean age
## 1                            1 50.00000
## 2                            4 35.66667
## 3                            7 30.00000
## 4                            8 55.00000
## 5                           11 52.33333
## 6                           22 39.81818
## 7                           25 31.00000
## 8                           30 45.40000
## 9                           52 36.45455
## 10                          60 29.75000

Exploring the dataset

Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):

  • the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.

  • on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.



## List of 3
##  $ axis.line       :List of 6
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.background: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.grid      : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

The previous plot represents the daily incidence of the desease across all the departments we are taking into account.

Frequency of cases across Departments


The major number of cases are in the capital Bogotà.

Cumulative number of confirmed cases


Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).

Variables plots

Spread of the desease across genders

brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))

ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +  
                              geom_bar(data = subset(colombia_covid, Sexo == "F")) +
                              geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) + 
                              scale_y_continuous(breaks = brks,
                                               labels = lbls) + 
                              coord_flip() +  
                              labs(title="Spread of the desease across genders",
                                   y = "Number of cases",
                                   x = "Department",
                                   fill = "Gender") +
                              theme_tufte() +
                              theme(plot.title = element_text(hjust = .5), 
                                    axis.ticks = element_blank(),
                                    axis.text.x = element_blank()) +  
                              scale_fill_brewer(palette = "Dark3")  

The desease (number of cases) is more or less equally distributed across genders.

Distribution of the desease across ages

#compute percentage so that we can label more precisely the pie chart
age_groups_pie <- colombia_covid %>% 
  group_by(`Grupo de edad`) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(`Grupo de edad`))
age_groups_pie$label <- scales::percent(age_groups_pie$per)

age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(`Grupo de edad`))) + 
  geom_bar(stat="identity", width = 1) +
  theme(axis.line = element_blank(), 
        plot.title = element_text(hjust=0.5)) + 
  labs(fill="Age groups", 
       x=NULL, 
       y=NULL, 
       title="Distribution of the desease across ages") +
  coord_polar(theta = "y") +
  #geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
  geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
  guides(fill = guide_legend(title = "Group"))
  
age_pie 

People from 31 to 45 years old are the most affected by the disease and people over 76 years old are the least affected. Colombia is a very young country. In 2018 the median age of the population was 30.4 years old and the largest age group is made of people from 25 to 54 years old, which comprises 41.98% of the population. (https://www.indexmundi.com/colombia/demographics_profile.html)

Age-Sex plot


There isn’t much difference between the sexes among the different group of ages.

Daily number per Tipo

theme_set(theme_classic())

ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
  scale_fill_brewer(palette = "Set3") +
  geom_histogram(aes(fill=Tipo), width = 0.8, stat="count") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title = "Daily number of confirmed cases", 
       subtitle = "subdivided across type",
       x = "Date of confirmation",
       fill = "Type")

Tipo plot

I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

type_pie <- colombia_covid %>% 
  group_by(Tipo) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(Tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("Tipo", "Total number", "Percentage")
type_pie
## # A tibble: 3 x 3
##   Tipo        `Total number` Percentage
##   <chr>                <int> <chr>     
## 1 Relacionado           8570 12%       
## 2 Importado              687 1%        
## 3 En Estudio           60406 87%

The majority of the cases in the country are people that got infected inside Colombia. Then, people that contracted the disease abroad came mainly from Europe, followed by North America and Central America.

Correlation between the categorical variables

We used Cramer’s V to calculate the correlation between our categorical variables.


The frequentist approach

Train/test split

We splitted the data so to leave out the last three points for prediction, because we have few points and because in this models it has no sense to predict a week, because the situation changes really fast.

Poisson

Poisson with Elapsed time as predictor

poisson1 <- glm(`Cumulative cases` ~ `Elapsed time`,
                data=data1[1:120, ], family=poisson)
plot(poisson1, which=1)

pred.pois1 <- poisson1$fitted.values
res.st1 <- (data1$`Cumulative cases`[1:120] - pred.pois1)/sqrt(pred.pois1)
#n=120, k=2, n-k=118
print(paste("Estimated overdispersion", sum(res.st1^2)/118))
poisson1.pred <- predict(poisson1, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson1.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson1$residuals^2))
#print(sprintf("MSE: %0.2f", sum(poisson1$residuals^2)/poisson1$df.residual))
#print(sprintf("MSE: %0.2f", anova(poisson1)['Residuals', 'Mean Sq']))
paste("AIC:", poisson1$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson1$null.deviance, deviance(poisson1)), 2))
## [1] "Estimated overdispersion 126.060786858437"
## [1] "RMSE: 1760.50922940986"
## [1] "AIC: 21915.0291925917"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 20716.39"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.158730158730159"

Prediction interval for test set


Poisson with Elapsed time as predictor for New cases/day

poisson1bis <- glm(`New cases/day` ~ `Elapsed time`,
                   data=data1[1:120, ], family=poisson)
plot(poisson1bis, which=1)

pred.pois1bis <- poisson1bis$fitted.values
res.st1bis <- (data1$`New cases/day`[1:120] - pred.pois1bis)/sqrt(pred.pois1bis)
#n=120, k=3, n-k=117
#print(paste("Estimated overdispersion", sum(res.st1bis^2)/23))
print("DO OVERDISPERSION!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
poisson1bis.pred <- predict(poisson1bis, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson1bis.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson1bis$residuals^2))
paste("AIC:", poisson1bis$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson1bis$null.deviance, deviance(poisson1bis)), 2))
## [1] "DO OVERDISPERSION!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
## [1] "RMSE: 58886.3428634314"
## [1] "AIC: 7291.77457395168"
## [1] "Null deviance:  64369.08"   "Residual deviance: 6442.53"

Predictive accuracy of the Poisson model for New cases/day

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.23015873015873"

Prediction interval for test set


Poisson with Elapsed time plus Elapsed time^2 as predictor

poisson1.5 <- glm(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2),
                  data=cases[1:120, ], family=poisson)
plot(poisson1.5, which=1)

pred.pois1.5 <- poisson1.5$fitted.values
res.st1.5 <- (cases$`Cumulative cases`[1:120] - pred.pois1.5)/sqrt(pred.pois1.5)
#n=120, k=3, n-k=117
print(paste("Estimated overdispersion", sum(res.st1.5^2)/117))
poisson1.5.pred <- predict(poisson1.5, newdata = cases[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson1.5.pred - cases$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson1$residuals^2))
#print(sprintf("MSE: %0.2f", sum(poisson1$residuals^2)/poisson1$df.residual))
#print(sprintf("MSE: %0.2f", anova(poisson1)['Residuals', 'Mean Sq']))
paste("AIC:", poisson1.5$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson1.5$null.deviance, deviance(poisson1.5)), 2))
## [1] "Estimated overdispersion 75.2646213768741"
## [1] "RMSE: 6940.37335779742"
## [1] "AIC: 12062.7254007421"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 10862.08"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.134920634920635"

Prediction interval for test set


Poisson with Elapsed time plus Sexo

poisson2 <- glm(`Cumulative cases` ~ `Elapsed time` + Sexo_M,
                data=data1[1:120, ], family=poisson)
plot(poisson2, which=1)

poisson2.pred <- predict(poisson2, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson2.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson2$residuals^2))
paste("AIC:", poisson2$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson2$null.deviance, deviance(poisson2)), 2))
## [1] "RMSE: 3892.80826581986"
## [1] "AIC: 20887.674847589"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 19687.03"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.166666666666667"

Prediction interval for test set


Poisson with Elapsed time plus Group de edad

poisson3 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
                  `Grupo de edad_31_45` + `Grupo de edad_46_60` +
                  `Grupo de edad_60_75` + `Grupo de edad_76+`,
                data=data1[1:120, ], family=poisson)
plot(poisson3, which=1)

pred.pois3 <- poisson3$fitted.values
res.st3 <- (data1$`Cumulative cases` - pred.pois3)/sqrt(pred.pois3)
#n=120, k=7, n-k=113
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st3^2)/113))
poisson3.pred <- predict(poisson3, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson3.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson3$residuals^2))
paste("AIC:", poisson3$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson3$null.deviance, deviance(poisson3)), 2))
## [1] "Estimated overdispersion 439618.33736315"
## [1] "RMSE: 3249.31716470375"
## [1] "AIC: 20650.0253613547"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 19441.38"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.126984126984127"

Prediction interval for test set


Poisson with Elapsed time plus Department

poisson4 <- glm(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
                  `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
                  `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
                  `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
                  `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
                  `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
                data=data1[1:120, ], family=poisson)
plot(poisson3, which=1)

pred.pois4 <- poisson4$fitted.values
res.st4 <- (data1$`Cumulative cases` - pred.pois4)/sqrt(pred.pois4)
#n=120, k=7, n-k=113
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st4^2)/113))
poisson4.pred <- predict(poisson4, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson4.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson4$residuals^2))
paste("AIC:", poisson4$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson4$null.deviance, deviance(poisson4)), 2))
## [1] "Estimated overdispersion 463825.016009504"
## [1] "RMSE: 3328.87135693358"
## [1] "AIC: 19275.3910967978"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 18054.75"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.0793650793650794"

Prediction interval for test set


Poisson with Elapsed time, Age and Departments as predictors

poisson5 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` +
                  `Grupo de edad_46_60` + `Grupo de edad_60_75` +`Grupo de edad_76+` +
                  `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
                  `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
                  `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
                  `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` +
                  `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` +
                  `Departamento o Distrito_Tolima`, data=data1[1:120, ], family=poisson)
plot(poisson5, which=1)

pred.pois5 <- poisson5$fitted.values
res.st5 <- (data1$`Cumulative cases` - pred.pois5)/sqrt(pred.pois5)
#n=120, k=17, n-k=103
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st5^2)/103))
poisson5.pred <- predict(poisson5, newdata = data1[120:126, ], type="response")
#paste("Real: ", data1$`Cumulative cases`[120:126], "Predict: ", poisson5.pred)
paste("RMSE:", sqrt(mean((poisson5.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson5$residuals^2))
paste("AIC:", poisson5$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson5$null.deviance, deviance(poisson5)), 2))
## [1] "Estimated overdispersion 542039.960916532"
## [1] "RMSE: 8243.70282563776"
## [1] "AIC: 18085.2709276108"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 16854.63"

Predictive accuracy of the Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.119047619047619"

Prediction interval for test set


Poisson with Elapsed time, Age and Departments as predictors for New cases/day

poisson5bis <- glm(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` +
                     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
                     `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
                     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
                     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
                     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
                     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
                     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
                   data=data1[1:120, ], family=poisson)
plot(poisson5bis, which=1)

pred.pois5bis <- poisson5bis$fitted.values
res.st5bis <- (data1$`Cumulative cases` - pred.pois5bis)/sqrt(pred.pois5bis)
#n=120, k=18, n-k=102
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st5bis^2)/102))
poisson5bis.pred <- predict(poisson5bis, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson5bis.pred - data1$`New cases/day`[120:126])^2)))
#paste("MSE:", mean(poisson5$residuals^2))
paste("AIC:", poisson5bis$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson5bis$null.deviance, deviance(poisson5bis)), 2))
## [1] "Estimated overdispersion 6719176.07903642"
## [1] "RMSE: 1225.26041309972"
## [1] "AIC: 2860.38819023372"
## [1] "Null deviance:  64369.08"   "Residual deviance: 1979.14"

Predictive accuracy of the Poisson model for New cases/day

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.436507936507937"

Prediction interval for test set

Work in progress.. the plot came out very strange, if we don’t plan to include it in the presentation it is not worth spending time in it.. also because in the previous plot the interval is strangely visible!

Poisson with Elapsed time, Elapsed time^2, Age and Departments as predictors

poisson6 <- glm(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2) +
                  `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` +
                  `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
                  `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
                  `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
                  `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
                  `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
                  `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
                data=data1[1:120, ], family=poisson)
plot(poisson6, which=1)

pred.pois6 <- poisson6$fitted.values
res.st6 <- (data1$`Cumulative cases` - pred.pois6)/sqrt(pred.pois6)
#n=120, k=19, n-k=101
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st6^2)/101))
poisson6.pred <- predict(poisson6, newdata = data1[120:126, ], type="response")
#paste("Real: ", data1$`Cumulative cases`[120:126], "Predict: ", poisson6.pred)
paste("RMSE:", sqrt(mean((poisson6.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson4$residuals^2))
paste("AIC:", poisson6$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson6$null.deviance, deviance(poisson6)), 2))
## [1] "Estimated overdispersion 1238590.22360691"
## [1] "RMSE: 8441.28307878747"
## [1] "AIC: 8781.84768119021"
## [1] "Null deviance:  1740689.09" "Residual deviance: 7549.2"

Predictive accuracy of the Poisson model for Cumulative cases

## [1] "Frequency of coverage: 0.0416666666666667"

Prediction interval for test set

Same obs as before.. there is something strange in the function when the model overestimates the predictions

Summary table - Poisson models for Cumulative cases

Model AIC RMSE
Cumulative cases ~ Elapsed time 21915 1760
Cumulative cases ~ Elapsed time + Elapsed time^2 12062 6940
Cumulative cases ~ Elapsed time + Sex 20887 3852
Cumulative cases ~ Elapsed time + Age group 20650 3249
Cumulative cases ~ Elapsed time + Department 19275 3328
Cumulative cases ~ Elapsed time + Department + Age group 18085 8243
Cumulative cases ~ Elapsed time + Elapsed time^2 + Department + Age group 8781 8441

Summary table - Poisson models for New cases/day

Model AIC RMSE
New cases/day ~ Elapsed time 8427 58642
New cases/day ~ Elapsed time + Age group + Department 2860 1225

Autocorrelation to compare Poisson models

We generated 1000 samples from each of the four Poisson models and calculated the autocorrelation and compared against the autocorrelation of our original sample.

Autocorrelation for the models when the response variable is Cumulative Cases


Autocorrelation for the models when the response variable is New cases/day


ANOVA to compare the Poisson models

anova(poisson1, poisson2, poisson3, poisson4, poisson5, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: `Cumulative cases` ~ `Elapsed time`
## Model 2: `Cumulative cases` ~ `Elapsed time` + Sexo_M
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+`
## Model 4: `Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
## Model 5: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       118      20716                          
## 2       117      19687  1  1029.35 < 2.2e-16 ***
## 3       113      19441  4   245.65 < 2.2e-16 ***
## 4       107      18055  6  1386.63 < 2.2e-16 ***
## 5       102      16855  5  1200.12 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quasi-Poisson

Quasi Poisson with Elapsed time as predictor

poisson1quasi <- glm(`Cumulative cases` ~ `Elapsed time`,
                     data=data1[1:120, ], family=quasipoisson)
plot(poisson1quasi, which=1)

pred.poisq <- poisson1quasi$fitted.values
res.stq <- (data1$`Cumulative cases` - pred.poisq)/sqrt(summary(poisson1quasi)$dispersion*pred.poisq)
#n=120, k= ?, n-k=?
print(paste("Estimated overdispersion", sum(res.stq^2)/23))
poisson1quasi.pred <- predict(poisson1quasi, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((poisson1quasi.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson1quasi$residuals^2))
paste("AIC:", poisson1quasi$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson1quasi$null.deviance, deviance(poisson1quasi)), 2))
## [1] "Estimated overdispersion 15728.0601174826"
## [1] "RMSE: 1760.50922940986"
## [1] "AIC: NA"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 20716.39"

Predictive accuracy of the Quasi-Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.849206349206349"

Quasi Poisson with Elapsed time as predictors for New cases/day

poisson1quasibis <- glm(`New cases/day` ~ `Elapsed time`,
                        data=data1[1:120, ], family=quasipoisson)
plot(poisson1quasibis, which=1)

pred.pois1quasibis <- poisson1quasibis$fitted.values
res.st1quasibis <- (data1$`Cumulative cases` - pred.pois1quasibis)/sqrt(pred.pois1quasibis)
#n=120, k=18, n-k=102
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st1quasibis^2)/102))
poisson1quasibis.pred <- predict(poisson1quasibis, newdata = data1[120:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson1quasibis.pred - data1$`New cases/day`[120:126])^2)))
#paste("MSE:", mean(poisson1quasibis$residuals^2))
paste("AIC:", poisson1quasibis$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson1quasibis$null.deviance, deviance(poisson1quasibis)), 2))
## [1] "Estimated overdispersion 8958055.49491902"
## [1] "RMSE: 753.966239397641"
## [1] "AIC: NA"
## [1] "Null deviance:  64369.08"   "Residual deviance: 6442.53"

Predictive accuracy of the Quasi Poisson model for New cases/day

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.183333333333333"

Prediction interval for test set


quite horrible.. maybe unnecessary since the previous plot is good!

Quasi Poisson with Elapsed time and Age as predictor

poisson2quasi <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
                       `Grupo de edad_31_45` + `Grupo de edad_46_60` +
                       `Grupo de edad_60_75` + `Grupo de edad_76+`,
                     data=data1[1:120, ], family=quasipoisson)
plot(poisson1quasi, which=1)

pred.poisq2 <- poisson2quasi$fitted.values
res.stq2 <- (data1$`Cumulative cases` - pred.poisq2)/sqrt(summary(poisson2quasi)$dispersion*pred.poisq2)
#n=120, k= ?, n-k=?
print(paste("Estimated overdispersion", sum(res.stq2^2)/18))
poisson2quasi.pred <- predict(poisson2quasi, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((poisson2quasi.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(poisson2quasi$residuals^2))
paste("AIC:", poisson2quasi$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson2quasi$null.deviance, deviance(poisson2quasi)), 2))
## [1] "Estimated overdispersion 21853.7140901657"
## [1] "RMSE: 3249.31716470375"
## [1] "AIC: NA"
## [1] "Null deviance:  1740689.09"  "Residual deviance: 19441.38"

Predictive accuracy of the Quasi-Poisson model for Cumulative cases

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.873015873015873"

Prediction interval for test set

Almost like the previous one.. but maybe even here it is not necessary

Summary table - Quasi Poisson models for Cumulative cases

Model RMSE
Cumulative cases ~ Elapsed time 1760
Cumulative cases ~ Elapsed time + Age group 3249

Summary table - Quasi Poisson models for New cases/day

Model RMSE
New cases/day ~ Elapsed time 753

Negative Binomial

Negative Binomial with Elapsed time as predictor

nb1 <- glm.nb(`Cumulative cases` ~ `Elapsed time`,
              data=data1[1:120, ])
plot(nb1, which=1)

#n=120, k=2, n-k=118
stdres <- rstandard(nb1)
print(paste("Estimated overdispersion", sum(stdres^2)/118))
nb1.pred <- predict(nb1, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb1.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(nb1$residuals^2))
paste("AIC:", nb1$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb1$null.deviance, deviance(nb1)), 2))
## [1] "Estimated overdispersion 177.301055415717"
## [1] "RMSE: 1765.66109007326"
## [1] "AIC: 21911.1770465374"
## [1] "Null deviance:  1738722.15"  "Residual deviance: 20710.41"

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.158730158730159"

Prediction interval for test set


Negative Binomial with Elapsed time plus Age as predictors

nb2 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30`+
                `Grupo de edad_31_45` + `Grupo de edad_46_60` +
                `Grupo de edad_60_75` + `Grupo de edad_76+`,
              data=data1[1:120, ])
plot(nb2, which=1)

nb2.pred <- predict(nb2, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb2.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(nb2$residuals^2))
paste("AIC:", nb2$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb2$null.deviance, deviance(nb2)), 2))
## [1] "RMSE: 3253.44941468989"
## [1] "AIC: 20645.2695912123"
## [1] "Null deviance:  1738975.56"  "Residual deviance: 19434.52"

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.126984126984127"

Prediction interval for test set


Negative Binomial with Elapsed time plus Department as predictors

nb3 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` +
                `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas`+
                `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
                `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
                `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
                `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`,
              data=data1[1:120, ])
plot(nb3, which=1)

nb3.pred <- predict(nb3, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb3.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(nb3$residuals^2))
paste("AIC:", nb3$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb3$null.deviance, deviance(nb3)), 2))
## [1] "RMSE: 3335.37126416692"
## [1] "AIC: 19270.2478165627"
## [1] "Null deviance:  1739087.22" "Residual deviance: 18047.5"

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.0793650793650794"

Prediction interval for test set


Negative Binomial with Elapsed time plus Continent of origin as predictors

# nbNO <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`, data=data1[1:22, ])
# plot(nb4, which=1)
# nb4.pred <- predict(nb4, newdata = data1[23:25, ], type = "response")
# paste("RMSE:", sqrt(mean((nb4.pred - data1$`Cumulative cases`[23:25])^2)))
# #paste("MSE:", mean(nb4$residuals^2))
# paste("AIC:", nb4$aic)
# paste(c("Null deviance: ", "Residual deviance:"),
#        round(c(nb4$null.deviance, deviance(nb4)), 2))

Negative Binomial with Elapsed time, Age and Departments as pedictors

nb4 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` +
                `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` +
                `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
                `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
                `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
                `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` +
                `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` +
                `Departamento o Distrito_Tolima`, data=data1[1:120, ])
plot(nb4, which=1)

# Calculating overdispersion n=120 k=19 n-k=101
stdres <- rstandard(nb4)
print(paste("Estimated overdispersion", sum(stdres^2)/101))
nb4.pred <- predict(nb4, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb4.pred - data1$`Cumulative cases`[120:126])^2)))
#paste("MSE:", mean(nb5$residuals^2))
paste("AIC:", nb4$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb4$null.deviance, deviance(nb4)), 2))
## [1] "Estimated overdispersion 187.46366613162"
## [1] "RMSE: 8253.89918410353"
## [1] "AIC: 18080.4893960953"
## [1] "Null deviance:  1739076.35"  "Residual deviance: 16847.74"

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.119047619047619"

Prediction interval for test set


Negative Binomial with Elapsed time, Age and Departments as pedictors for New cases/day

nb4bis <- glm.nb(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` +
                   `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` +
                   `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
                   `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
                   `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
                   `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` +
                   `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` +
                   `Departamento o Distrito_Tolima`, data=data1[1:120, ])
plot(nb4bis, which=1)

# Calculating overdispersion n=120 k=19 n-k=101
stdres2 <- rstandard(nb4bis)
print(paste("Estimated overdispersion", sum(stdres2^2)/101))
nb4bis.pred <- predict(nb4bis, newdata = data1[120:126, ], type = "response")
paste("RMSE:", sqrt(mean((nb4bis.pred - data1$`New cases/day`[120:126])^2)))
#paste("MSE:", mean(nb4bis$residuals^2))
paste("AIC:", nb4bis$aic)
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb4bis$null.deviance, deviance(nb4bis)), 2))
## [1] "Estimated overdispersion 1.6253849069091"
## [1] "RMSE: 1379.56664915406"
## [1] "AIC: 1435.14403991019"
## [1] "Null deviance:  1416.29"   "Residual deviance: 146.46"

Predictive accuracy of the NB model for New cases/day

Predicting with a \(95\%\) confidence interval

## [1] "Frequency of coverage: 0.936507936507937"

Prediction interval for test set


One word.. horrible!

Summary table - Negative Binomial models for Cumulative cases

Model AIC RMSE
Cumulative cases ~ Elapsed time 21911 1765
Cumulative cases ~ Elapsed time + Age group 20645 3253
Cumulative cases ~ Elapsed time + Department 19270 3335
Cumulative cases ~ Elapsed time + Department + Age group 1808 8253

Summary table - Negative Binomial models for New cases/day

Model AIC RMSE
New cases/day ~ Elapsed time + Age group + Department 1435 1379

Autocorrelation to compare Negative Binomial models

We generated 1000 samples from each of the four Negative Binomial models and calculated the autocorrelation and compared against the autocorrelation of our original sample.

Autocorrelation for the models when the response variable is Cumulative Cases


Autocorrelation for the model when the response variable is New cases/day


Applying ANOVA to compare the negative binomial models

#Applying ANOVA to compare the negative binomial models
anova(nb1, nb2, nb3, nb4)
## Likelihood ratio tests of Negative Binomial Models
## 
## Response: Cumulative cases
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Elapsed time
## 2                                                                                                                                                                                                                                                                                                                                                                                                                         `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3                                                                                                                             `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`
## 4 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n    `Departamento o Distrito_Tolima`
##      theta Resid. df    2 x log-lik.   Test    df LR stat. Pr(Chi)
## 1 11252779       118       -21905.18                              
## 2 12919544       113       -20629.27 1 vs 2     5 1275.907       0
## 3 13821346       107       -19242.25 2 vs 3     6 1387.022       0
## 4 13728065       102       -18042.49 3 vs 4     5 1199.758       0

GAM

library(mgcv)
library(purrr)
library(stringr)
df <- data1 %>% 
  set_names( ~ str_replace_all(names(data1)," ", "_"))
df <- df %>% 
  set_names( ~ str_replace_all(names(df),"/", "_"))
names(df)[names(df) == "Grupo_de_edad_76+"] <- "Grupo_de_edad_76"

Gam with Elapsed time as covariate for Cumulative cases

#fit, predict, intervals
gam1 <- gam(Cumulative_cases~s(Elapsed_time), family = poisson, data = df[1:120,])
gam1.pred<-predict(gam1, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam1.pred - df$Cumulative_cases[120:126])^2)))
## [1] "RMSE: 1282.65627040657"
paste("AIC:", AIC(gam1))
## [1] "AIC: 1587.11794466718"
#intervals
se.gam1.train<-predict(gam1, se=TRUE, df[1:120,])$se.fit
train.lwr.gam1<-gam1$fitted.values - 1.96 * se.gam1.train
train.upr.gam1<-gam1$fitted.values + 1.96 * se.gam1.train

se.gam1.test<-predict(gam1, se=TRUE, df[120:126,])$se.fit
test.lwr.gam1<-gam1.pred - 1.96*se.gam1.test
test.upr.gam1<-gam1.pred + 1.96*se.gam1.test

ggplot(df, aes(Elapsed_time, Cumulative_cases)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam1, ymax = train.upr.gam1),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam1, ymax = test.upr.gam1),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam1$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam1.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = Cumulative_cases)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="Cumulative cases ~ Elapsed time") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time as covariate for New cases/day

#fit, predict, intervals
gam1B <- gam(New_cases_day~s(Elapsed_time), family = poisson, data = df[1:120,])
gam1B.pred<-predict(gam1B, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam1B.pred - df$New_cases_day[120:126])^2)))
## [1] "RMSE: 1121.41502755924"
paste("AIC:", AIC(gam1B))
## [1] "AIC: 6361.02810119354"
#intervals
se.gam1B.train<-predict(gam1B, se=TRUE, df[1:120,])$se.fit
train.lwr.gam1B<-gam1B$fitted.values - 1.96 * se.gam1B.train
train.upr.gam1B<-gam1B$fitted.values + 1.96 * se.gam1B.train

se.gam1B.test<-predict(gam1B, se=TRUE, df[120:126,])$se.fit
test.lwr.gam1B<-gam1B.pred - 1.96*se.gam1B.test
test.upr.gam1B<-gam1B.pred + 1.96*se.gam1B.test

ggplot(df, aes(Elapsed_time, New_cases_day)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam1B, ymax = train.upr.gam1B),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam1B, ymax = test.upr.gam1B),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam1B$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam1B.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = New_cases_day)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="New cases/day ~ Elapsed time") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time and Elapsed time^2 as covariate for Cumulative cases

#fit, predict, intervals
gam1C <- gam(Cumulative_cases~s(Elapsed_time)+s(I(Elapsed_time^2)), family = poisson, data = df[1:120,])
gam1C.pred<-predict(gam1C, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam1C.pred - df$Cumulative_cases[120:126])^2)))
## [1] "RMSE: 7633.40056525472"
paste("AIC:", AIC(gam1C))
## [1] "AIC: 1373.12905571594"
#intervals
se.gam1C.train<-predict(gam1C, se=TRUE, df[1:120,])$se.fit
train.lwr.gam1C<-gam1C$fitted.values - 1.96 * se.gam1C.train
train.upr.gam1C<-gam1C$fitted.values + 1.96 * se.gam1C.train

se.gam1C.test<-predict(gam1C, se=TRUE, df[120:126,])$se.fit
test.lwr.gam1C<-gam1C.pred - 1.96*se.gam1C.test
test.upr.gam1C<-gam1C.pred + 1.96*se.gam1C.test

ggplot(df, aes(Elapsed_time, Cumulative_cases)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam1C, ymax = train.upr.gam1C),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam1C, ymax = test.upr.gam1C),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam1C$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam1C.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = Cumulative_cases)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="Cumulative cases ~ Elapsed time + Elapsed time^2") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time and Sex as covariate for Cumulative cases

#fit, predict, intervals
gam2 <- gam(Cumulative_cases~s(Elapsed_time)+s(Sexo_M), family = poisson(), data = df[1:120,])
gam2.pred<-predict(gam2, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam2.pred - df$Cumulative_cases[120:126])^2)))
## [1] "RMSE: 2301.65489717445"
paste("AIC:", AIC(gam2))
## [1] "AIC: 1525.13432986912"
#intervals
se.gam2.train<-predict(gam2, se=TRUE, df[1:120,])$se.fit
train.lwr.gam2<-gam2$fitted.values - 1.96 * se.gam2.train
train.upr.gam2<-gam2$fitted.values + 1.96 * se.gam2.train

se.gam2.test<-predict(gam2, se=TRUE, df[120:126,])$se.fit
test.lwr.gam2<-gam2.pred - 1.96*se.gam2.test
test.upr.gam2<-gam2.pred + 1.96*se.gam2.test

ggplot(df, aes(Elapsed_time, Cumulative_cases)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam2, ymax = train.upr.gam2),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam2, ymax = test.upr.gam2),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam2$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam2.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = Cumulative_cases)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="Cumulative cases ~ Elapsed time + Sex") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time and Age as covariate for Cumulative cases

#fit, predict, intervals
gam3 <- gam(Cumulative_cases~s(Elapsed_time)+s(Sexo_M)+Grupo_de_edad_19_30 + Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76, family = poisson(), data = df[1:120,])
gam3.pred<-predict(gam3, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam3.pred - df$Cumulative_cases[120:126])^2)))
## [1] "RMSE: 3874.19862207675"
paste("AIC:", AIC(gam3))
## [1] "AIC: 1519.73391196235"
#intervals
se.gam3.train<-predict(gam3, se=TRUE, df[1:120,])$se.fit
train.lwr.gam3<-gam3$fitted.values - 1.96 * se.gam3.train
train.upr.gam3<-gam3$fitted.values + 1.96 * se.gam3.train

se.gam3.test<-predict(gam3, se=TRUE, df[120:126,])$se.fit
test.lwr.gam3<-gam3.pred - 1.96*se.gam3.test
test.upr.gam3<-gam3.pred + 1.96*se.gam3.test

ggplot(df, aes(Elapsed_time, Cumulative_cases)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam3, ymax = train.upr.gam3),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam3, ymax = test.upr.gam3),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam3$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam3.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = Cumulative_cases)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="Cumulative cases ~ Elapsed time + Age group") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time and Department as covariate for Cumulative cases

#fit, predict, intervals
gam4 <- gam(Cumulative_cases~s(Elapsed_time)+s(Sexo_M)+Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima, family = poisson(), data = df[1:120,])
gam4.pred<-predict(gam4, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam4.pred - df$Cumulative_cases[120:126])^2)))
## [1] "RMSE: 1808.4031672862"
paste("AIC:", AIC(gam4))
## [1] "AIC: 1486.02139832832"
#intervals
se.gam4.train<-predict(gam4, se=TRUE, df[1:120,])$se.fit
train.lwr.gam4<-gam4$fitted.values - 1.96 * se.gam4.train
train.upr.gam4<-gam4$fitted.values + 1.96 * se.gam4.train

se.gam4.test<-predict(gam4, se=TRUE, df[120:126,])$se.fit
test.lwr.gam4<-gam4.pred - 1.96*se.gam4.test
test.upr.gam4<-gam4.pred + 1.96*se.gam4.test

ggplot(df, aes(Elapsed_time, Cumulative_cases)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam4, ymax = train.upr.gam4),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam4, ymax = test.upr.gam4),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam4$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam4.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = Cumulative_cases)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="Cumulative cases ~ Elapsed time + Department") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time and Department and Age as covariate for Cumulative cases

#fit, predict, intervals
gam5 <- gam(Cumulative_cases~s(Elapsed_time)+Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima + Grupo_de_edad_19_30 + Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76, family = poisson(), data = df[1:120,])
gam5.pred<-predict(gam5, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam5.pred - df$Cumulative_cases[120:126])^2)))
## [1] "RMSE: 1038.67068102747"
paste("AIC:", AIC(gam5))
## [1] "AIC: 1486.7078613768"
#intervals
se.gam5.train<-predict(gam5, se=TRUE, df[1:120,])$se.fit
train.lwr.gam5<-gam5$fitted.values - 1.96 * se.gam5.train
train.upr.gam5<-gam5$fitted.values + 1.96 * se.gam5.train

se.gam5.test<-predict(gam5, se=TRUE, df[120:126,])$se.fit
test.lwr.gam5<-gam5.pred - 1.96*se.gam5.test
test.upr.gam5<-gam5.pred + 1.96*se.gam5.test

ggplot(df, aes(Elapsed_time, Cumulative_cases)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam5, ymax = train.upr.gam5),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam5, ymax = test.upr.gam5),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam5$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam5.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = Cumulative_cases)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="Cumulative cases ~ Elapsed time + Department + Age group") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time and Department and Age as covariate for New cases/day

#fit, predict, intervals
gam5bis <- gam(New_cases_day~s(Elapsed_time)+s(Sexo_M)+Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima + Grupo_de_edad_19_30 + Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76, family = poisson(), data = df[1:120,])
gam5bis.pred<-predict(gam5bis, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam5bis.pred - df$New_cases_day[120:126])^2)))
## [1] "RMSE: 868.400823218105"
paste("AIC:", AIC(gam5bis))
## [1] "AIC: 1078.02668310693"
#intervals
se.gam5bis.train<-predict(gam5bis, se=TRUE, df[1:120,])$se.fit
train.lwr.gam5bis<-gam5bis$fitted.values - 1.96 * se.gam5bis.train
train.upr.gam5bis<-gam5bis$fitted.values + 1.96 * se.gam5bis.train

se.gam5bis.test<-predict(gam5bis, se=TRUE, df[120:126,])$se.fit
test.lwr.gam5bis<-gam5bis.pred - 1.96*se.gam5bis.test
test.upr.gam5bis<-gam5bis.pred + 1.96*se.gam5bis.test

ggplot(df, aes(Elapsed_time, New_cases_day)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam5bis, ymax = train.upr.gam5bis),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam5bis, ymax = test.upr.gam5bis),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam5bis$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam5bis.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = New_cases_day)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="New cases/day ~ Elapsed time + Department + Age group") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Gam with Elapsed time and Elapsed time^2 and Department and Age as covariate for Cumulative cases

#fit, predict, intervals
gam6 <- gam(Cumulative_cases~s(Elapsed_time)+s(I(Elapsed_time^2))+Departamento_o_Distrito_Bogotá_D.C. + Departamento_o_Distrito_Boyacá + Departamento_o_Distrito_Caldas + Departamento_o_Distrito_Casanare + Departamento_o_Distrito_Cauca + Departamento_o_Distrito_Cundinamarca + Departamento_o_Distrito_Meta + Departamento_o_Distrito_Quindío + Departamento_o_Distrito_Risaralda + Departamento_o_Distrito_Santander + Departamento_o_Distrito_Tolima + Grupo_de_edad_19_30 + Grupo_de_edad_31_45 + Grupo_de_edad_46_60 + Grupo_de_edad_60_75 + Grupo_de_edad_76, family = poisson(), data = df[1:120,])
gam6.pred<-predict(gam6, newdata=df[120:126,], type="response")
paste("RMSE:", sqrt(mean((gam6.pred - df$Cumulative_cases[120:126])^2)))
## [1] "RMSE: 4599.76488921526"
paste("AIC:", AIC(gam6))
## [1] "AIC: 1337.16765230608"
#intervals
se.gam6.train<-predict(gam6, se=TRUE, df[1:120,])$se.fit
train.lwr.gam6<-gam6$fitted.values - 1.96 * se.gam6.train
train.upr.gam6<-gam6$fitted.values + 1.96 * se.gam6.train

se.gam6.test<-predict(gam6, se=TRUE, df[120:126,])$se.fit
test.lwr.gam6<-gam6.pred - 1.96*se.gam6.test
test.upr.gam6<-gam6.pred + 1.96*se.gam6.test

ggplot(df, aes(Elapsed_time, Cumulative_cases)) +
  geom_ribbon(aes(x = Elapsed_time, ymin = train.lwr.gam6, ymax = train.upr.gam6),
              data = df[1:120, ],
              fill = color_scheme_get("blue")[[2]]) + 
  geom_ribbon(aes(x = Elapsed_time, ymin = test.lwr.gam6, ymax = test.upr.gam6),
              data = df[120:126, ],
              fill = color_scheme_get("red")[[2]]) +
  geom_line(aes(x = Elapsed_time, y = gam6$fitted.values),
              data = df[1:120, ],
              color = color_scheme_get("blue")[[4]], 
              size = 1.1) +
  geom_line(aes(x= Elapsed_time, y = gam6.pred),
              data = df[120:126, ],
              color = color_scheme_get("red")[[4]],
              size = 1.1) +
  geom_point(aes(x = Elapsed_time, y = Cumulative_cases)) +
  expand_limits(x = 28) +
  ggtitle("GAM Model", subtitle="New cases/day ~ Elapsed time + Elapsed time^2 + Department + Age") +
  xlab("Days") + ylab("Total cases") +
  scale_x_discrete(limit = c(0, 21, 42, 63, 84, 105, 122, 128),
                labels = c("06-3", "29-3", "19-4", "10-5", "31-5", "21-6", "06-7", "12-7")) +
  # facet_wrap(~"Department", scales ='free') +
  theme(strip.text.x = element_text(size = 12, colour = "black"),
        axis.text.x = element_text(face = "bold",
        color = "#993333", angle = 45, size = 9),
        plot.title = element_text(size = 22),
        axis.title.x = element_text(size = 18), 
        axis.title.y = element_text(size = 18))

Summary table - GAM models for Cumulative cases

Model AIC RMSE
Cumulative cases ~ Elapsed time 1587 1282
Cumulative cases ~ Elapsed time + Elapsed time^2 1373 7633
Cumulative cases ~ Elapsed time + Sex 1525 2301
Cumulative cases ~ Elapsed time + Age group 1519 3874
Cumulative cases ~ Elapsed time + Department 1486 1808
Cumulative cases ~ Elapsed time + Department + Age group 1486 1038
Cumulative cases ~ Elapsed time + Elapsed time^2 + Department + Age group 1337 4599

Summary table - GAM models for New cases/day

Model AIC RMSE
New cases/day ~ Elapsed time 6361 1121
New cases/day ~ Elapsed time + Age group + Department 1078 2868

The Bayesian approach

Poisson regression for response variable Cumulative cases/Department

As a first attempt, we fit a simple Poisson regression: \[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \] with \(i = 1,\dots,134\), being \(134\) the number of rows of our dataset, and \(y_i\) represents the number of cases.

For what concerns the stan program, we used the function poisson_log_rng to describe the distribution of \(y_i\), namely the number of cases each day and the function poisson_log_lpmf to specify the likelihood.

Posterior predictive check - Poisson Cumulative cases/Department

y_rep <- as.matrix(fit1, pars="y_rep")
ppc_dens_overlay(y = model.data$cases, y_rep[1:400,]) + 
  coord_cartesian(xlim = c(-1, 7000))

poisson_cumulative

The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis.

ppc_ecdf_overlay(model.data$cases, y_rep[1:200,])

poisson_cumulative_ecdf

Residual check - Poisson Cumulative cases/Department

#in this way we check the standardized residuals
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)

poisson_cumulative_res

The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at \(+2\) and \(-2\) indicates approximate \(95\%\) error bounds).

The plot of the standardized residuals indicates a large amount of overdispersion.

Intervals - Poisson Cumulative cases/Department

poisson_cumulative_intervals

Accuracy - Poisson Cumulative cases/Department

poisson_cumulative_acc

Poisson model for New cases/day

Same model, but now \(y_i\) represents New cases/day.

Posterioir predictive check - Poisson new cases/day

y_rep <- as.matrix(fit1.1, pars="y_rep")
ppc_dens_overlay(y = model.data.cases$cases, y_rep[1:200,]) + 
  coord_cartesian(xlim = c(-1, 1500))

poisson_cases

ppc_ecdf_overlay(model.data.cases$cases, y_rep[1:200,])

poisson_cases_ecdf

Residual check - Poisson new cases/day

#in this way we check the standardized residuals
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data.cases$cases - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_residual, ylim=c(-5, 10)) + hline_at(2) + hline_at(-2)

poisson_cases_res

Intervals - Poisson New cases/day

poisson_cases_intervals

Accuracy - Poisson New cases/day

poisson_cases_acc

Negative Binomial model for Cumulative cases /Department

We try to improve the previous model using the Negative Binomial model:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]

Where the parameter \(\phi\) is called precision and it is such that:

\[ E[y_i] = \lambda_i \\ Var[y_i] = \lambda_i + \frac{\lambda_i^2}{\phi} \]

again \(i=1,\dots,134\). As \(\phi \rightarrow \infty\) the negative binomial approaches the Poisson distribution.

The stan function that we use here are neg_binomial_2_log_rng to specify the distribution of \(y_i\) and the function neg_binomial_2_log_lpmf for the likelihood.

Posterior predictive check - NB Cumulative cases /Department

samples_NB <- rstan::extract(fit2)
y_rep <- samples_NB$y_rep
ppc_dens_overlay(y = model.data$cases, y_rep[1:200,]) + 
  coord_cartesian(xlim = c(-1, 6000))

NB_cumulative

ppc_ecdf_overlay(model.data$cases, y_rep)

NB_cumulative_ecdf

Residual check - NB Cumulative cases /Department

mean_inv_phi <- mean(samples_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)

NB_cumulative_res

The situation is better now, but still we have too many residuals outside the \(95\%\) interval.

Intervals - NB Cumulative cases /Department

ppc_intervals(
  y = cases_dep$`Cumulative cases/Department`, 
  yrep = y_rep,
  x = cases_dep$`Elapsed time`
) + 
  labs(x = "Days", y = "Cases")

NB_cumulative_intervals

Accuracy across departments - NB Cumulative cases /Department

ppc_stat_grouped(
  y = model.data$cases,
  yrep = y_rep,
  group = cases_dep$Department,
  stat = "median",
  binwidth = 0.2
)

NB_cumulative_acc

We should take into account the differences across departments.

Negative Binomial model for New cases/day

Posterior predictive check - NB New cases/day

samples_NB <- rstan::extract(fit2.1)
y_rep <- samples_NB$y_rep
ppc_dens_overlay(y = model.data.cases$cases, y_rep[1:200,]) + 
  coord_cartesian(xlim = c(-1, 1000))

NB_cases

NB_cases_ecdf

Residual check - NB New cases/day

mean_inv_phi <- mean(samples_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data.cases$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)

NB_cases_res

The situation is better now, but still we have too many residuals outside the \(95\%\) interval.

Intervals - NB New cases/day

ppc_intervals(
  y = cases_dep$`New cases/day`, 
  yrep = y_rep,
  x = cases_dep$`Elapsed time`
) + 
  labs(x = "Days", y = "Cases")

NB_cases_intervals

Accuracy across departments - NB New cases/day

ppc_stat_grouped(
  y = model.data.cases$cases,
  yrep = y_rep,
  group = cases_dep$Department,
  stat = "median",
  binwidth = 0.2
)

NB_cases_acc

Multilevel Negative Binomial regression for Cumulative/Department

We try to fit the following model, which also includes Age as covariat:

\[ ln(\lambda_i) = \alpha + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \]

Posterior predictive check - multi NB Cumulative/Department

samples_NB2 <- rstan::extract(fit3)
y_rep <- samples_NB2$y_rep
ppc_dens_overlay(y = model.data2$cases, y_rep[1:200,]) + 
  coord_cartesian(xlim = c(-1, 6000))

NB2_cumulative

ppc_ecdf_overlay(cases_dep$`Cumulative cases/Department`, y_rep)

NB2_cumulative_ecdf

Residual check - multi NB Cumulative/Department

mean_inv_phi <- mean(samples_NB2$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data2$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)

NB2_cumulative_res

Intervals - - multi NB Cumulative/Department

ppc_intervals(
  y = cases_dep$`Cumulative cases/Department`, 
  yrep = y_rep,
  x = cases_dep$`Elapsed time`
) + 
  labs(x = "Days", y = "Cases")

NB2_cumulative_intervals

Accuracy across departments - multi NB Cumulative/Department

ppc_stat_grouped(
  y = model.data2$cases,
  yrep = y_rep,
  group = cases_dep$Department,
  stat = "median",
  binwidth = 0.2
)

NB2_cumulative_acc

Multilevel Negative Binomial Regression for New cases/day

Posterior predictive check - multi NB New cases/day

samples_NB2 <- rstan::extract(fit3.1)
y_rep <- samples_NB2$y_rep
ppc_dens_overlay(y = model.data2.cases$cases, y_rep[1:200,]) + 
  coord_cartesian(xlim = c(-1, 750))

NB2_cases

ppc_ecdf_overlay(cases_dep$`New cases/day`, y_rep)

NB2_cases_ecdf

Residual check - multi NB New cases/day

mean_inv_phi <- mean(samples_NB2$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data2.cases$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)

NB2_cases_res

Intervals - multi NB New cases/day

ppc_intervals(
  y = cases_dep$`New cases/day`, 
  yrep = y_rep,
  x = cases_dep$`Elapsed time`
) + 
  labs(x = "Days", y = "Cases")

NB2_cases_intervals

Accuracy across departments - multi NB New cases/day

ppc_stat_grouped(
  y = model.data2.cases$cases,
  yrep = y_rep,
  group = cases_dep$Department,
  stat = "median",
  binwidth = 0.2
)

NB2_cases_acc

Hierarchical model for Cumulative/Department

In order to improve the fit, we fit a model with department-specific intercept term.

So the varying intercept model that we take into account is now:

\[ ln(\lambda_{i,d}) = \alpha_d + + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age_i\\ \alpha_d \sim \mathcal{N}(\mu + \beta_{pop}\cdot pop_d + \beta_{sur}\cdot surface_d + \beta_{dens} \cdot density_d, \sigma_{\alpha})\\ y_i \sim \mathcal{Negative Binomial}(\lambda_{i,d}, \phi) \]

The priors used for the above model are the following:

\[ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \\ \psi \sim \mathcal{N}(0,1) \]

being \(\psi = [\beta_{pop}, \beta_{sur}, \beta_{dens}]\).

New dataset

We added the following covariats into the dataset:

  • People: millions of inhabitants for each region;

  • Surface: \(km^3\), extent of each region;

  • Density: \(\frac{people}{km^2}\), density of the population in each region.

The model is:

Posterior predictive check - hierarchical NB Cumulative/Department

samples_hier <- rstan::extract(fit.4)
y_rep <- samples_hier$y_rep
ppc_dens_overlay(y = data.hier.NB.complete$cases, y_rep[1:200,]) + 
  coord_cartesian(xlim = c(-1, 6000))

hier_cumulative

ppc_ecdf_overlay(cases_dep$`Cumulative cases/Department`, y_rep)

hier_cumulative_ecdf

Residual check - hierarchical NB Cumulative/Department

mean_inv_phi <- mean(samples_hier$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (data.hier.NB.complete$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)

hier_cumulative_res

Very few points are now outside the \(95\%\) confidence interval.

Intervals - hierarchical NB Cumulative/Department

ppc_intervals(
  y = cases_dep$`Cumulative cases/Department`, 
  yrep = y_rep,
  x = cases_dep$`Elapsed time`
) + 
  labs(x = "Days", y = "Cases")

hier_cumulative_intervals

Accuracy across departments - hierarchical NB Cumulative/Department

ppc_stat_grouped(
  y = data.hier.NB.complete$cases,
  yrep = y_rep,
  group = cases_dep$Department,
  stat = "median",
  binwidth = 0.2
)

hier_cumulative_acc

We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.

Hierarchical model for New cases/day

Posterior predictive check - hierarchical NB New cases/day

samples_hier <- rstan::extract(fit.4.1)
y_rep <- samples_hier$y_rep
ppc_dens_overlay(y = data.hier.NB.complete.cases$cases, y_rep[1:200,]) + 
  coord_cartesian(xlim = c(-1, 500))

hier_cases

ppc_ecdf_overlay(cases_dep$`New cases/day`, y_rep)

hier_cases_ecdf

Residual check - hierarchical NB New cases/day

mean_inv_phi <- mean(samples_hier$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (data.hier.NB.complete.cases$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual, ylim = c(-3, 4)) + hline_at(2) + hline_at(-2)

hier_cases_res

Very few points are now outside the \(95\%\) confidence interval.

Intervals - hierarchical NB New cases/day

ppc_intervals(
  y = cases_dep$`New cases/day`, 
  yrep = y_rep,
  x = cases_dep$`Elapsed time`
) + 
  labs(x = "Days", y = "Cases")

hier_cases_intervals

Accuracy across departments - hierarchical NB New cases/day

ppc_stat_grouped(
  y = data.hier.NB.complete.cases$cases,
  yrep = y_rep,
  group = cases_dep$Department,
  stat = "median",
  binwidth = 0.2
)

hier_cases_acc

We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.

LOOIC

The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.

Plot the looic to compare models:

loo.all.deps<-c(loo.model.Poisson.cases[3], loo.model.NB.cases[3], loo.model.NB2.cases[3], loo.model.NB.hier.cases[3])

sort.loo.all.deps<- sort.int(loo.all.deps, index.return = TRUE)$x

par(xaxt="n")
plot(sort.loo.all.deps, type="b", xlab="", ylab="LOOIC", main="Model comparison")
par(xaxt="s")
axis(1, c(1:4), c("Poisson", "NB-sl", "NB-ml", 
                  "hier")[sort.int(loo.all.deps,
                    index.return = TRUE)$ix],
                    las=2)

looic